Numba - CPU parallelisation

The choice of whether to use parallel CPU target for any given algorithm depends on a number of factors.

This notebook illustrates how the 'size' and 'shape' of a given dataset may be a factor, for a simple algorithm which is equivalent to the default implementation of sklearn.preprocessing.StandardScaler.

More details on numba's parallelisation features are given in the excellent numba docs



In [1]:

    
import numpy as np
import pandas as pd
from numba import njit, prange
from pytest import approx
from sklearn.datasets import load_iris
from sklearn.preprocessing import StandardScaler

%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns; sns.set()
import matplotlib
matplotlib.rc('figure', figsize=(10, 5))

Note regding the following algorithms:

standard: uses single threaded CPU target
standard_parallel: uses parallel CPU target and numba.prange explicit parallel loop



In [2]:

    
@njit(parallel=False)
def standard(A):
    """
    Standardise data by removing the mean and scaling to unit variance,
    equivalent to sklearn StandardScaler.
    """
    n = A.shape[1]
    res = np.empty_like(A, dtype=np.float64)

    for i in range(n):
        data_i = A[:, i]
        res[:, i] = (data_i - np.mean(data_i)) / np.std(data_i)

    return res



In [3]:

    
@njit(parallel=True)
def standard_parallel(A):
    """
    Standardise data by removing the mean and scaling to unit variance,
    equivalent to sklearn StandardScaler.
    
    Uses explicit parallel loop; may offer improved performance in some
    cases.
    """
    n = A.shape[1]
    res = np.empty_like(A, dtype=np.float64)

    for i in prange(n):
        data_i = A[:, i]
        res[:, i] = (data_i - np.mean(data_i)) / np.std(data_i)

    return res

We're going to use the IRIS dataset (150 rows x 4 columns) for this test.



In [4]:

    
A = load_iris().data



In [5]:

    
expected = StandardScaler().fit_transform(A)



In [6]:

    
output = standard(A)



In [7]:

    
np.allclose(output, expected)









    Out[7]:





True



In [8]:

    
output_parallel = standard_parallel(A)



In [9]:

    
np.allclose(output_parallel, expected)









    Out[9]:





True



In [10]:

    
def highlight_min(s):
    is_min = s == s.min()
    return ['background-color: yellow' if v else '' for v in is_min]

Firstly, we're going to tile the data 'horizontally' so that we keep the same number of rows but add successively more columns.



In [11]:

    
res = []
multiples = range(1, 42, 5)

for idx, i in enumerate(multiples):
    data = np.tile(A, i)
    
    o_1 = %timeit -o -q StandardScaler().fit_transform(data)
    o_2 = %timeit -o -q standard(data)
    o_3 = %timeit -o -q standard_parallel(data)
    
    res.append((data.shape[1], o_1.best, o_2.best, o_3.best))
    print('{0} of {1} complete {2}'.format(idx + 1, len(multiples), data.shape))









    



1 of 9 complete (150, 4)
2 of 9 complete (150, 24)
3 of 9 complete (150, 44)
4 of 9 complete (150, 64)
5 of 9 complete (150, 84)
6 of 9 complete (150, 104)
7 of 9 complete (150, 124)
8 of 9 complete (150, 144)
9 of 9 complete (150, 164)



In [12]:

    
df = pd.DataFrame(res, columns = ['num_cols', 'sklearn', 'numba CPU', 'numba CPU parallel'])



In [13]:

    
df = df.set_index('num_cols')
df = df.apply(lambda x: 1000 * x)



In [14]:

    
ax = df.plot()
ax.set_title('Standard scale data: 150 rows by n columns')
ax.set_xlabel('Number of columns')
ax.set_ylabel('Time (ms)')
plt.legend(prop={'size': 14})









    Out[14]:





<matplotlib.legend.Legend at 0x225a08b5be0>



In [15]:

    
df.style.apply(highlight_min, axis=1)









    Out[15]:




  
 
     
         
        sklearn 
        numba CPU 
        numba CPU parallel 
    
     
        num_cols 
         
         
         
     
     
        4 
        0.121786 
        0.00475795 
        0.0643363 
    
     
        24 
        0.151202 
        0.0257044 
        0.0745132 
    
     
        44 
        0.168868 
        0.0467578 
        0.0838214 
    
     
        64 
        0.206255 
        0.0896142 
        0.0844222 
    
     
        84 
        0.200158 
        0.0960841 
        0.0875621 
    
     
        104 
        0.234245 
        0.11817 
        0.0931101 
    
     
        124 
        0.244757 
        0.137082 
        0.0958816 
    
     
        144 
        0.249212 
        0.161065 
        0.0994426 
    
     
        164 
        0.263761 
        0.180338 
        0.103951

In the above results, observe the crossing point where CPU parallel is (and remains) the fastest computational strategy.

Furthermore, observe its relative insensitivity to number of columns.

Next, we're going to repeat the experiment but this time tiling the data 'vertically' so that we add successively more rows.



In [17]:

    
res = []

for idx, i in enumerate(multiples):
    data = np.tile(A.T, i).T
    o_1 = %timeit -o -q StandardScaler().fit_transform(data)
    o_2 = %timeit -o -q standard(data)
    o_3 = %timeit -o -q standard_parallel(data)
    
    res.append((data.shape[0], o_1.best, o_2.best, o_3.best))
    print('{0} of {1} complete {2}'.format(idx + 1, len(multiples), data.shape))









    



1 of 9 complete (150, 4)
2 of 9 complete (900, 4)
3 of 9 complete (1650, 4)
4 of 9 complete (2400, 4)
5 of 9 complete (3150, 4)
6 of 9 complete (3900, 4)
7 of 9 complete (4650, 4)
8 of 9 complete (5400, 4)
9 of 9 complete (6150, 4)



In [18]:

    
df = pd.DataFrame(res, columns = ['num_rows', 'sklearn', 'numba CPU', 'numba CPU parallel'])



In [19]:

    
df = df.set_index('num_rows')
df = df.apply(lambda x: 1000 * x)



In [20]:

    
ax = df.plot()
ax.set_title('Standard scale data: n rows by 4 columns')
ax.set_xlabel('Number of rows')
ax.set_ylabel('Time (ms)')
plt.legend(prop={'size': 14})









    Out[20]:





<matplotlib.legend.Legend at 0x225a05ac4e0>



In [21]:

    
df.style.apply(highlight_min, axis=1)









    Out[21]:




  
 
     
         
        sklearn 
        numba CPU 
        numba CPU parallel 
    
     
        num_rows 
         
         
         
     
     
        150 
        0.123694 
        0.00479796 
        0.0644024 
    
     
        900 
        0.140786 
        0.0195348 
        0.0181729 
    
     
        1650 
        0.161948 
        0.0342706 
        0.0522027 
    
     
        2400 
        0.2159 
        0.0612522 
        0.0801998 
    
     
        3150 
        0.235414 
        0.0728591 
        0.146206 
    
     
        3900 
        0.315953 
        0.0993358 
        0.159655 
    
     
        4650 
        0.657485 
        0.170097 
        0.246642 
    
     
        5400 
        0.644944 
        0.18712 
        0.415349 
    
     
        6150 
        0.728555 
        0.212345 
        0.461296

In this case, observe that CPU parallel is almost never optimal.

Furthermore, observe its sensitivity to number of rows.

	sklearn	numba CPU	numba CPU parallel
num_cols
4	0.121786	0.00475795	0.0643363
24	0.151202	0.0257044	0.0745132
44	0.168868	0.0467578	0.0838214
64	0.206255	0.0896142	0.0844222
84	0.200158	0.0960841	0.0875621
104	0.234245	0.11817	0.0931101
124	0.244757	0.137082	0.0958816
144	0.249212	0.161065	0.0994426
164	0.263761	0.180338	0.103951

	sklearn	numba CPU	numba CPU parallel
num_rows
150	0.123694	0.00479796	0.0644024
900	0.140786	0.0195348	0.0181729
1650	0.161948	0.0342706	0.0522027
2400	0.2159	0.0612522	0.0801998
3150	0.235414	0.0728591	0.146206
3900	0.315953	0.0993358	0.159655
4650	0.657485	0.170097	0.246642
5400	0.644944	0.18712	0.415349
6150	0.728555	0.212345	0.461296